AI-Driven Real-Time Emotion Detection Using an Explainable CNN-GCN Hybrid Model

Authors: Priyanka Kudithetti, B. Kranthi Kiran

DOI Link: https://doi.org/10.22214/ijraset.2026.83615

Abstract

Facial emotion recognition is an important component of intelligent systems and has applications in areas such as human–computer interaction, healthcare, surveillance, and behavioral analysis. Although deep learning techniques have achieved significant progress in this field, many existing approaches mainly rely on facial appearance features and often overlook the structural relationships between facial landmarks, which can affect their performance in real-time environments. To address this challenge, this study presents an AI-driven real-time emotion recognition framework based on a hybrid Convolutional Neural Network (CNN) and Graph Convolutional Network (GCN). The proposed model combines a pretrained ResNet18 network for extracting facial appearance features with a GCN module that captures geometric relationships from 468 facial landmarks obtained using MediaPipe Face Mesh. To improve feature learning and model robustness, geometric loss and adversarial feature regularization are incorporated into the training process. The framework was trained and evaluated on the CK+ dataset containing seven emotion categories and compared with several state-of-the-art models, including VGG16, VGG19, ResNet18, MobileNetV3, ConvNeXt, and EfficientNet-B0. Experimental results show that the proposed CNN-GCN model achieved an accuracy of 99.2%, while ResNet18 and EfficientNet-B0 achieved the highest accuracy of 99.6%. To support practical deployment, the system was implemented as a Flask-based web application capable of real-time webcam and image-based emotion recognition. In addition, explainable AI techniques, including Grad-CAM and LIME, were integrated to provide visual insights into model predictions. The proposed framework offers an accurate, interpretable, and reliable solution for real-time emotion-aware intelligent applications.

Introduction

Facial Emotion Recognition (FER) is an important area of artificial intelligence, computer vision, and human-computer interaction that aims to automatically identify human emotions from facial expressions. FER has applications in healthcare, intelligent transportation, education, security, virtual environments, and emotion-aware user interfaces. Recent advances in deep learning have significantly improved emotion recognition accuracy by enhancing feature extraction and classification. However, many existing systems still face challenges such as poor performance under varying lighting, pose changes, occlusions, expression intensity variations, lack of interpretability, vulnerability to adversarial attacks, and unstable real-time predictions.

To address these limitations, the study proposes an intelligent real-time facial emotion recognition framework that combines visual appearance analysis and facial geometric relationship modeling. The framework improves recognition accuracy, robustness, interpretability, and deployment efficiency in real-time environments. It supports reliable emotion-aware systems for healthcare monitoring, behavior analysis, smart learning, transportation, and security applications.

Related Work

Previous studies have applied deep learning, generative adversarial networks, occlusion-adaptive models, and real-time emotion analytics systems for facial emotion recognition. Researchers have also explored explainable AI methods such as Grad-CAM and LIME to improve transparency in model decisions. Facial landmark extraction techniques like MediaPipe Face Mesh have enhanced facial representation accuracy, while datasets such as CK+ have become standard benchmarks. Despite these advances, challenges remain in integrating geometric facial information, interpretability, prediction stability, and robustness in real-time systems.

Proposed Methodology

The proposed framework combines:

Convolutional Neural Networks (CNNs) for extracting facial appearance features.
Graph Convolutional Networks (GCNs) for modeling geometric relationships among facial landmarks.
Explainable AI (XAI) techniques using Grad-CAM and LIME.
Flask-based web deployment for real-time webcam and image-based emotion recognition.

The system jointly analyzes facial appearance and structural facial geometry to improve emotion recognition reliability and transparency.

Data and Preprocessing

The framework uses facial emotion datasets containing seven emotion categories:

Anger
Disgust
Fear
Happiness
Sadness
Surprise
Neutral (or Contempt in CK+)

Preprocessing steps include:

Dataset organization and partitioning into training, validation, and testing sets.
Class imbalance mitigation using weighted sampling.
Image normalization and resizing.
Data augmentation through rotation, flipping, and color adjustments.
Extraction of 468 facial landmarks using MediaPipe Face Mesh.
Geometric feature engineering to capture facial structure and expression changes.

Models Evaluated

Several deep learning architectures were tested:

CNN-GCN Hybrid Model
VGG16
VGG19
ResNet18
MobileNet
ConvNeXt
EfficientNet

The CNN-GCN architecture combines facial appearance and geometry, while lightweight models like MobileNet support resource-constrained real-time applications.

Explainability and Deployment

The framework integrates:

Grad-CAM to highlight important facial regions influencing predictions.
LIME to generate local explanations for emotion classifications.
A Flask web application supporting:
- User authentication
- Image uploads
- Webcam-based live emotion detection
- Real-time prediction visualization

These features improve user trust, transparency, and usability.

Experimental Results

Performance was evaluated using:

Accuracy
Precision
Recall
F1-Score

Results showed excellent performance across models:

Model	Accuracy
ResNet18	99.6%
EfficientNet	99.6%
CNN-GCN	99.2%
MobileNet	98.8%
VGG16	98.0%
ConvNeXt	97.6%
VGG19	96.7%

ResNet18 and EfficientNet achieved the highest performance, demonstrating strong generalization and feature-learning capabilities.

Conclusion

Facial emotion recognition plays a crucial role in enabling intelligent human–computer interaction, healthcare support, behavioral analysis, and automated monitoring applications. In this work, a real-time emotion recognition framework was developed by combining a pretrained ResNet-based Convolutional Neural Network (CNN) with a Graph Convolutional Network (GCN) to capture both facial appearance features and the structural relationships among 468 facial landmarks extracted using MediaPipe Face Mesh. The model was trained and evaluated on the CK+ dataset, which consists of seven emotion categories. To improve feature representation and classification performance, a hybrid loss function incorporating cross-entropy loss and geometric loss was employed, while adversarial feature regularization enhanced the model\'s robustness. Explainable AI techniques, including Grad-CAM and LIME, were integrated to provide greater transparency and interpretability of the predictions. The framework was further deployed as a Flask-based web application capable of performing both real-time webcam-based and image-based emotion recognition. Experimental results demonstrated the effectiveness of the proposed approach, achieving an accuracy of 99.2% with the CNN-GCN model, while EfficientNet-B0 achieved the highest accuracy of 99.6%. Overall, the proposed system offers an accurate, reliable, and interpretable solution for real-time emotion-aware intelligent applications.

References

[1] J. Xu, H. Si, H. Wang, S. Li, H. Wang, W. Song, and R. Zhu, “ELG: Emotion Recognition Convolutional Model Integrating Local and Global Facial Features,” in 2024 IEEE International Conference on High Performance Computing and Communications (HPCC), 2024, pp. 61–68. [2] Y. Zhang, J. Li, and M. Noto, “Application of a Hybrid CNN-SE-GCN Model with Multi-Loss Optimization and Adversarial Samples in Real-Time Emotion Recognition,” in Proceedings of the 2025 14th International Conference on Software and Computer Applications, 2025, pp. 191–197. [3] D. Li, J. Sun, W. Liu, L. Wang, and N. Zhou, “Dense-Attention CNN with Spatial-Attention Fusion for Robust Facial Expression Recognition,” Information Technology and Control, vol. 54, no. 3, pp. 844–863, 2025. [4] M. Karnati, A. Seal, D. Bhattacharjee, A. Yazidi, and O. Krejcar, “Understanding Deep Learning Techniques for Recognition of Human Emotions Using Facial Expressions: A Comprehensive Survey,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–31, 2023. [5] X. Jin, Z. Lai, and Z. Jin, “Learning Dynamic Relationships for Facial Expression Recognition Based on Graph Convolutional Network,” IEEE Transactions on Image Processing, vol. 30, pp. 7143–7155, 2021. [6] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Geometry Guided Pose-Invariant Facial Expression Recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 4445–4460, 2020. [7] F. M. A. Mazen, A. A. Nashat, and R. A. A. A. Seoud, “Real-Time Face Expression Recognition Along With Balanced FER2013 Dataset Using CycleGAN,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, 2021. [8] M. Aly and N. S. Alotaibi, “A Comprehensive Deep Learning Framework for Real-Time Emotion Detection in Online Learning Using Hybrid Models,” Scientific Reports, 2025. [9] M. A. Sarikaya and G. Ince, “Improved BCI Calibration in Multimodal Emotion Recognition Using Heterogeneous Adversarial Transfer Learning,” PeerJ Computer Science, vol. 11, p. e2649, 2025. [10] H. A. Shehu, W. Browne, and H. Eisenbarth, “An Adversarial Attacks Resistance-Based Approach to Emotion Recognition From Images Using Facial Landmarks,” in 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2020, pp. 1307–1314. [11] H. Burrows, J. Zarrin, L. Babu-Saheer, and M. Maktab-Dar-Oghaz, “Realtime Emotional Reflective User Interface Based on Deep Convolutional Neural Networks and Generative Adversarial Networks,” Electronics, vol. 11, no. 1, p. 118, 2021. [12] V. Janjanam, D. M. Thallam, B. G. P. Spoorthi, and K. G. Suma, “Learning-Based Real-Time Emotion Analytics for Multi-Person Settings With Confidence Visualization,” in 2026 IEEE International Conference on AI Engineering and Innovations (AIEI), 2026, pp. 1–6. [13] M. Sadiq, Y. Zhang, Y. Zhou, M. Mahmud, M. Azhar, M. Durad, and J. Liang, “A Context-Aware Dropout-Based Occlusion-Adaptive Network for Robust Facial Landmark and Emotion Detection,” Journal of King Saud University Computer and Information Sciences, 2026. [14] L. R. Ropero, J. De Laet, F. Lemic, P. S. Nácher, N. N. Bhat, S. Abadal, and X. Costa-Pérez, “Towards Emotion Recognition With 3D Pointclouds Obtained From Facial Expression Images,” IEEE Transactions on Affective Computing, 2026. [15] M. Dwijayanti, M. Iqbal, and B. Y. Suprapto, “Real-Time Implementation of Face Recognition and Emotion Recognition in a Humanoid Robot Using a Convolutional Neural Network,” IEEE Access, vol. 10, pp. 89876–89886, 2022. [16] J. Shan and S. Eliyas, “Exploring AI Facial Recognition for Real-Time Emotion Detection: Assessing Student Engagement in Online Learning Environments,” in 2024 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT), 2024, pp. 1–6. [17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. [18] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust You? Explaining the Predictions of Any Classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144. [19] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C. Chang, M. G. Yong, J. Lee, et al., “MediaPipe: A Framework for Building Perception Pipelines,” arXiv preprint arXiv:1906.08172, 2019. [20] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The Extended Cohn-Kanade Dataset (CK+): A Complete Dataset for Action Unit and Emotion-Specified Expression,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2010, pp. 94–101.

Copyright

Copyright © 2026 Priyanka Kudithetti, B. Kranthi Kiran. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET83615

Publish Date : 2026-06-12

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here